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Section 20 

Kantorovich-Rubinstein Theorem. 



Let {S,d) be a separable metric space. Denote by Vi{S) the set of all laws on S such that for some z G S 
(equivalently, for all z £ S), 



/ d{x,z)V{x) 
Js 



< oo. 



Let us denote by 



M(P, Q) = |/i : /i is a law on S X 5 with marginals P and q|. 
Definition. For P,Q e Vi{S), the quantity 

W^(P,Q) = inf{ J d{x,y)dn{x,y) : /x G M(P,Q)} 

is called the Wasserstein distance between P and Q. 

A measure /x e M(P, Q) represents a transportation between measures P and Q. We can think of the 
conditional distribution as a way to redistribute the mass in the neighborhood of a point x so that 

the distribution P will be redistributed to the distribution Q. If the distance d{x, y) represents the cost of 
moving x to y then the Wasserstein distance gives the optimal total cost of transporting P to Q. 
Given any two laws P and Q on S, let us define 



7(p,Q)=sup{iy/dP 



< 1 



} 



md(P,Q) = sup{ J fdF + J gdq : f,gG C{S), f{x)+g{y) < d{x,y)]. 



and 



Lemma 40 PFe /lawe 7(P, Q) = md(P, Q). 

Proof. Given a function / such that ||/||l < 1 let us take a small £ > and g{y) = —f{y) — £• Then 

f{x) + g{y) = f{x) - f{y) - e< d{x, y) - s < d{x, y) 

and 

j fdP + Jgdq = J fdP- J fdq-e. 
Combining with the choice of —f{x) and g{y) = f{y) — e we get 

J fdP- J fdQ\<sup{J fdP + J gdQ: f{x)+g{y) <d{x,y)]+s 
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which, of course, proves that 

7(P, Q) < sup{ J fdP + J gdQ : /(x) + giy) < d{x, y) } . 

Let us now consider functions f,g such that f{x) + g{y) < d{x,y). Define 

e{x) = mf{d{x,y) - g{y)) = -sup{g{y) - d{x,y)) 



Clearly, 
and, therefore. 

Function e satisfies 



f{x) < e{x) < d{x,x) - g(x) = -g(x) 
j fdF + j gdQ < j edF - J edQ. 

e{x) - e{x') = sup{g{y) - d{x', y)) - sup{g{y) - d{x, y)) 

V V 

< sup(d(a;, J/) — d{x' ,y)) < d{x,x') 

V 

which means that ||e||L = 1- This finishes the proof. 

□ 

We will need the following version of the Haim-Banach theorem. 

Theorem 48 (Hahn-Banach) Let V be a normed vector space, E - a linear subspace ofV and U - an open 

convex set in V such that U H E ^ ib. If r : E ^ ^ is a linear non-zero functional on E then there exists a 
linear functional p : V ^M. such that p\e = r and supu p{x) = sup^/p,^ r{x). 

Proof. Let t = sup{r(.x) : x e U E] and let B = {x e E : r{x) > t}. Since B is convex and [/ n i? = 0, 
the Hahn-Banach separation theorem implies that there exists a linear functional q : V M. such that 
supijq{x) < mfsqix). For any xq G U (1 E let F = {x G E : q{x) = q{xo)}- Since q{xo) < mfsqix), 
F B = %. This means that the hyperplanes {x ^ E : q{x) = (j(.xo)} and {x E E : r{x) = t} in the subspace 
E are parallel and this implies that q{x) = ar{x) on E for some a ^ 0. Let p = q/a. Then r = p\e and 

suppix) = —snpq{x) < — inf g(a;) = inf rfa;) =t = sup r{x). 

U 01 u Oi B B 

Since r = p\e, this finishes the proof. 

□ 

Theorem 49 If S is a compact metric space then VF(P,Q) = rnd(P,Q) /or P,Q e Pi (5). 

Proof. Consider a vector space V = C{S x S) equipped with || • Hoo norm and let 

U={fGV:f{x,y)<d{x,y)}. 

Obviously, U is convex and open because S x S is compact and any continuous function on a compact 
achieves its maximum. Consider a linear subspace E of V defined by 

E={<f>eV : <t>{x,y) = f{x)+g{y)} 

so that 

UnE={f{x) + g{y)<d{x,y)}. 
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Define a linear functional r on E hy 

r{0)= j fdF + j 9dQ if <P = f{x) + g{y). 

By the above Hahn-Banach theorem, r can be extended to p : V such that p\e = r and 

supp((^) = sup r{(l)) = md(P,Q). 

u unE 

Let us look at the properties of this functional. First of all, if a{x, y) > then p{a) > 0. Indeed, for any c > 

U 9 d{x, y) — c - a{x, y) — e < d{x, y) 

and, therefore, for all c > 

p{d — ca — e) = p{d) — cp{a) — p{s) < sup p < oo. 

u 

This can hold only if p{a) > 0. This implies that if (pi < ^2 then p{4>i) < ^(^2)- For any function </>, both 
—^5 <P < ll?^l|oo ■ 1 and, by monotonicity of p, 

\pm<m\oop{i) = m\oo. 

Since S x S is compact and p is a continuous functional on {C{S x S), \\ ■ \\oo), by the Reisz representation 
theorem there exists a unique measure p, on the Borel cr-algebra on S x S such that 

P{f) = j f{x,y)dp{x,y). 

Since p\e = r, 

Jifix) + g{y))dp{x,y) = JfdP + J gdQ 
which implies that p S M(P, Q). Wc have 

md(P,Q) =supp((/)) =sup|y f{x,y)dp{x,y): f{x,y) <d{x,y)^ = J d{x,y)dp{x,y) > W{F,Q). 

The opposite inequality is easy because for any f,g such that f{x) + g{y) < d{x,y) and any v g M(P, Q), 

j fdP + j gdq = j {f{x)+g{y))dy{x,y) < j d{x,y)du{x,y). (20.0.1) 

This finishes the proof and, moreover, it shows that the infimum in the definition of W is achieved on p. 

a 

Remark. Notice that in the proof of this theorem we never used the fact that d is a metric. Theorem holds 
for any d € C{S x S) under the corresponding integrability assumptions. For example, one can consider loss 
functions of the type d{x,yY for p> 1, which are not necessarily metrics. However, in Lemma 40, the fact 
that d is a metric was essential. 

Our next goal will be to show that W = ^ on separable and not necessarily compact metric spaces. We 
start with the following. 

Lemma 41 // (5, d) is a separable metric space then W and 7 are metrics on Vi {S) . 

Proof. Since for a bounded Lipschitz metric (3 we have /3(P, Q) < 7(P, Q), 7 is also a metric because if 
7(P,Q) = then /3(P,Q) = and, therefore, P = Q. As in (20.0.1), it should be obvious that 7(P,Q) = 
md(P, Q) < W{¥, Q) and if W{¥, Q) = then 7(P, Q) = and P = Q. Symmetry of W is obvious. It remains 
to show that W{F, Q) satisfies the triangle inequality. The idea will be rather simple, but to have well-defined 
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conditional distributions wc will need to approximate distributions on S x S with given marginals by a more 
regular disributions with the same marginals. Let us first explain the main idea. Consider three laws P, Q, T 
on S and let /j, e M(P, Q) and u e M(Q, T) be such that 



j d{x, y)dn{x, y) < W{¥, Q) + e and j d{y, z)dv{y, z) < W{Q, T) 



■ e. 



Let us generate a distribution j on Sx Sx S with marginals P, Q and T and marginals on pairs of coordinates 
{x, y) and (j/, z) given by and v by "gluing" jjL and u in the following way. Let us generate y from distribution 
Q and, given y, generate x and z according to conditional distributions n{x\y) and v{z\y) independently of 
each other, i.e. 

l{x,z\y) = ii{x\y) X v{z\y). 

Obviously, by construction, (a;, y) has distribution and {y, z) has distribution v. Therefore, the marginals 
of X and z are P and T which means that the pair (.t, z) has distribution 77 € M(P, T). Finally, 

M^(P,T) < j d{x,z)dr]{x,z) = J d{x,z)d'j{x,y,z) < J d{x,y)d'y + j d{y,z)d^ 

= j d{x, y)dn + J d{y, z)dv < W{Y, Q) + T^(Q, T) + 2e. 

Letting £ — > proves the triangle inequality for W . It remains to explain how the conditional distributions 
can be well defined. Let us modify /i by 'discretizing' it without losing much in the transportation cost 

integral. Given e > i), consider a partition (S'„)„>i of S such that diamctcr(S'„) < s for all n. This can be 
done as in the proof of Strassen's theorem, Case C. On each box Sn x Sm let 

be the marginal distributions of the conditional distribution of 11 on Sn x Sm- Define 

= tJ-{Sn X Sm) lAim ^ l^nm,- 

In this construction, locally on each small box Sn x Sm, measure jj, is replaced by the product measure with 
the same marginals. Let us compute the marginals of jj,'. Given a set C C S, 

ti'iCxS) = J2l^(^nXSm) t^'nmiC)xt^lmiS) 

= J2 fiC^ ^ ^n) xSm) = Yl '"(('^ nSn)xS) = J2 H S„) = ¥{C). 

n^m n n 

Similarly, ^'{S x C) = Q(C), so ji' has the same marginals as /U, ji' G M(P, Q). It should be obvious that 
transportation cost integral does not change much by replacing jj, with /x'. One can visualize this by looking 
at what happens locally on each small box Sn x Sm- Let (X„,y^) be a random pair with distribution /x 
restricted to Sn x Sm so that 



m{Xn,Ym) = Q \ I d{x,y)diJ,{x,y). 

l^(On X i>m) Js„xSm 



Let be an independent copy of Ym, also independent of X„, i.e. the joint distribution of (X„,y^) is 

Mnm X Unm ^^'^ 

Ed{Xn,Y^) = / d{x,y)d{nlm x l4im){x,y)- 

Js„xSm 

Then 

d{x,y)dii{x,y) = '^IJ,{Sn x Sm)Kd{Xn,Ym), 
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J d{x,y)dii'{x,y) = J2f^iSn X 5„)Ed(X„, F^). 

n.m 

Finally, d{Ym, Y^) < diam(5'm) < e and these two integrals differ by at most e. Therefore, 

J d{x,y)d^!{x,y) < W(P,Q) + 2e. 



Similarly, we can define 



I/' = ^ v{Sn X Sra) 

such that 



X 

nm nm 



j d{x,y)du' 



{x,y)<W{q,T)+2s. 



We will now show that this special simple form of the distributions fi'{x, y), v' {y^ z) ensures that the condi- 
tional distributions of x and z given y are well defined. Let be the restriction of Q to Sm, 

Q„(C) = Q(Cn 5„) = X Sm) 

n 

Obviously, if Qm(C') = then jj^^^C) = for all n, which means that jj^^ are absolutely continuous with 
respect to and the Radon-Nikodym derivatives 

fnm{y) = ^^^^{y) exist and ^/u('S'n x 5„)/„„(y) = 1 a.s. for y e 5^. 
Let us define a conditional distribution of x given y by 

n.m 

Notice that for any A G B, ^'{A\y) is measurable in y and iJ.'{A\y) is a probability distribution on B, Q-a.s. 
over y because 

fJ-'iSlv) = "^KSn X Sm)fnm{y) = 1 a.S. 
n,m 

Let US check that for Borel sets A^BgB, 

f,'{A xB)= [ t,'{A\y)dQ{y). 
Jb 

Indeed, since fnm{y) = ioi y ^ Sm, 

j ,x'{A\y)dq{y) = 5^/x(5„x5„)m1„(^) / U„,{y)dqiy) 

■' ^ n,m •' B 

= ^m(5„ X Sm)l^lM f fnm{y)dQm{y) 

= J2fi{SnxS^)^il,^{A)fil^{B)=fx'{AxB). 

n,Tn 

Conditional distribution can be defined similarly. 

□ 

Next lemma shows that on a separable metric space any law with the "first moment", i.e. P € 7^1(5), can 
be approximated in metrics W and 7 by laws concentrated on finite sets. 
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Lemma 42 If{S, d) is separable andF e Vi{S) then there exists a sequence of laws Fn such t/iat P„(F„) = 1 
for some finite sets F„ and M^(P„,P), 7(P„,P) 0. 



in eadi set Snj and for A; > 1 define a function 



Proof. For each n > 1, let {Snj)j>i be a partition of S such that diam(S'„j) < 1/n. Take a point Xnj € S„ 

ne a fund 

fnk{x) = 



e Snj for j < fc, 
Xni, if X G Snj for j > k. 



We have, 



/ d{x,fnk{x))dF{x) = V / d(ar,/„fc(a;))cff(a;) < - Vp(5„,) + / d{x,Xni)dF{x) < 

J j>i-'S^j ^ j<k is\(S„iU-uS„fc) 



for k large enough because P G ViiS), i.e. J x„i)(iP(a;) < oo, and the set S \ {Sni U • • • U Snk) i 0- 

Let be the image on 5 x 5 of the measure P under the map x {fnk{x),x) so that /U„ G M(P„,P) 
for some P„ concentrated on the set of points {xni, ■ ■ ■ , Xnk}- Finally, 



2 

< . 

n 



W(P„,P)< J d{x,y)d,,n{x,y) = J d{fnk{x),x)dF{x) 

Since 7(P„,P) < W^(P„,P), this finishes the proof. 

□ 

We are finally ready to extend Theorem 49 to separable metric spaces. 

Theorem 50 (Kantorovich- Rubinstein) If{S,d) is a separable metric space then for any two distributions 
P,Q G Vi{S) we have W^(P,Q) = 7(P,Q). 

Proof. By previous lemma, we can approximate P and Q by P„ and Q„ concentrated on finite (hence, 
compact) sets. By Theorem 49, VK(P„,Q„) — 7(Pn,Qri)- Finally since both W,j are metrics. 



W{V,Q) < W{F,Vn) + W(Fn,Qn)- 

= W{V,Vn)+l{Vn,Qu) + W{Qn,Q) 

< W{¥, P„) + W{Qn,Q) + 7(Pn, P) + 7(Qn, Q) + l{P, < 



Letting n ^ oo proves that W{V,Q) < 7(P,Q). 



□ 



Wasserstein's distance Wp{V,Q). Given p > I, let us define the Wasserstein distance Wp(P, Q) on 
Pp{R"-) = {P : J |a;|^'dP(a:) < oo} corresponding to the cost function d{x,y) = |a; — y\^ by 

Wp(P,Q)f := ixd^j \x-y\Pdti{x,y):n&M{¥,q)} 

= sup{ J fdP + J gdQ: fix) + g{y) <\x- y^}. (20.0.2) 

Even though for p > 1 the function d{x, y) is not a metric, equality in (20.0.2) for compactly supported 
measures P and Q follows from the proof of Theorem 49, which does not require that d is a metric. Then 
one can easily extend (20.0.2) to the entire space M". Moreover, Wp is a metric on Vp{W^) which can be 
shown the same way as in Lemma 4L Namely, given nearly optimal n G M(P, Q) and v G M(Q, T) we can 
construct {X, Y, Z) ~ M(P, Q, T) such that {X, Y) jj. and {Y, Z) y and, therefore, 

VFp(P, T) < (E|X - Z\P) I < (E| X - Y\P) -p + (E|y - Z|f ) I < (^^(P, Q) + e) ? + (W^(Q, T) + e)l . 

Let £ i 0. 
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